Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses

نویسندگان

Yuanyuan Tian

Tao Zou

Fatma Özcan

Romulo Goncalves

Hamid Pirahesh

چکیده

HDFS has become an important data repository in the enterprise as the center for all business analytics, from SQL queries, machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have the efficient SQL support. In this paper, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement, and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm, and show that it is a robust join algorithm for hybrid warehouses which performs well in almost all cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Data Pre-partitioning and Distribution Optimization Approach for Distributed Data Warehouses

The increasing volumes of relational data let us find an alternative to cope with them. The Hadoop framework an open source project based on the MapReduce paradigm is a popular choice for distributed data warehouses and big data analytics. In this paper, we propose an original approach for partitioning and collocating data in distributed file systems, especially Hadoop-based systems, and this, ...

متن کامل

Building Data Warehouses Using the Enterprise Modeling Framework

This paper proposes an enterprise modeling framework for the deployment of data warehouses. The framework provides the information roadmap coordinating source data and different data warehouses across the business enterprise. The paper introduces a solution to address data warehousing issues at the enterprise level while avoiding the pitfalls of creating enterprise data warehouses and universal...

متن کامل

Automatic Workload Management for Enterprise Data Warehouses

Modern enterprise data warehouses have complex workloads that are notoriously difficult to manage. Additionally, RDBMSs have many “knobs” for managing workloads efficiently. These knobs affect the performance of query workloads in complex interrelated ways and require expert manual attention to change. It often takes a long time for a performance expert to get enough experience with a large war...

متن کامل

Data Mining for Intelligent Enterprise Resource Planning System

Enterprise Resource Planning or ERP is the practice of consolidating an enterprise’s planning, manufacturing, sales and marketing efforts into one management system. It attempts to integrate all departments and functions across a company onto a single computer system that can serve all those different departments' particular needs. This paper proposed an intelligent ERP system by integrating en...

متن کامل

Persistence in Enterprise Data Warehouses

Yet, persistence of redundant data in Data Warehouses is often simply justified with an achievement of better performance when accessing data for analysis and reporting. Especially in Enterprise Data Warehouse systems, data management via multiple persistence levels is necessary to condition the huge amount of data into an adequate format for its final usage. However, there are further reasons ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses

نویسندگان

چکیده

منابع مشابه

A Data Pre-partitioning and Distribution Optimization Approach for Distributed Data Warehouses

Building Data Warehouses Using the Enterprise Modeling Framework

Automatic Workload Management for Enterprise Data Warehouses

Data Mining for Intelligent Enterprise Resource Planning System

Persistence in Enterprise Data Warehouses

عنوان ژورنال:

اشتراک گذاری